Information Extraction from Historical Semi-Structured Handwritten Documents

نویسندگان

  • Xujun Peng
  • Huaigu Cao
  • Krishna Subramanian
  • Elizabeth Boschee
  • Rohit Prasad
  • Prem Natarajan
چکیده

In this paper, we describe our approach to extract salient events such as birth and death records from historical French parish documents that contain free-form handwritten text. The challenges posed by these documents to the current state of the art in handwriting recognition and information extraction go well beyond the generic challenges in recognizing handwritten text such as style variations, irregular baselines, poor legibility, etc. Our approach for extracting salient events from such documents has the following processing steps: (1) pre-processing for noise removal and high-quality binarization, (2) OCR for text recognition, and (3) statistical information extraction for event record extraction. In this paper, we focus on preprocessing techniques for robust binarization in presence of different types of degradations that are common in historical documents. We provide a detailed description of our system, experimental setup, and results for each stage of the processing. In addition, we compare different approaches for preprocessing by assessing their impact on OCR performance.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Survey on Information Extraction from Chemical Compound Literatures: Techniques and Challenges

Chemical documents, especially those involving drug information, comprise a variety of types – the most common being journal articles, patents and theses. They typically contain large amounts of chemical information, such as PubMed-ID, activity classes and adverse or side effects. Techniques are used to extract information from a huge number of documents and it is presented in a useful structur...

متن کامل

W Web Information Extraction

Information extraction (IE) is the process of automatically extracting structured pieces of information from unstructured or semi-structured text documents. Classical problems in information extraction include named-entity recognition (identifying mentions of persons, places, organizations, etc.) and relationship extraction (identifying mentions of relationships between such named entities). We...

متن کامل

Populating Ontologies with Data from OCRed Lists

A flexible, accurate, and efficient method of automatically extracting facts from lists in OCRed documents and inserting them into an ontology would help make those facts machine searchable, queryable, and linkable and expose their rich ontological interrelationships. To work well, such a process must be adaptable to variations in list format, tolerant of OCR errors, and careful in its selectio...

متن کامل

Populating Ontologies by Semi-automatically Inducing Information Extraction Wrappers for Lists in OCRed Documents

A flexible, accurate, and efficient method of extracting facts from lists in OCRed documents and inserting them into an ontology would help make those facts machine queryable, linkable, and editable. But, to work well, such a process must be adaptable to variations in list format, tolerant of OCR errors, and careful in its selection of human guidance. We propose a wrapper-induction solution for...

متن کامل

Space characters in Chinese semi-structured texts

Space characters can have an important role in disambiguating text. However, few, if any, Chinese information extraction systems make full use of space characters. However, it seems that treatment of space characters is necessary, especially in cases of extracting information from semi-structured documents. This investigation aims to address the importance of space characters in Chinese informa...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2012